Number of Features and Overfitting

Question:

A classic way to overfit an algorithm is by using lots of features and not a lot of training data. You can find the starter code in feature_selection/find_signature.py. Get a decision tree up and training on the training data, and print out the accuracy. How many training points are there, according to the starter code?

Start Quiz:

INSTRUCTOR NOTE:

Special Note: Depending on when you downloaded the code provided for find_signature.py, you may need to change the code in lines 9-10 to be

words_file = "../text_learning/your_word_data.pkl"
authors_file = "../text_learning/your_email_authors.pkl"

so that the files created from running vectorize_text.py are reflected properly.

In addition, if you are having trouble getting the code to run due to memory issues, then if you are on version 0.16.x of scikit-learn, you can remove the .toarray() function from the line where features_train is created to save on memory - the decision tree classifier can, in that version take as input a sparse array instead of only dense arrays.

Next Concept